1 Introduction

In this report, I intend to elaborate on how we can use tools available in R to acquire and manipulate data from the web. This is incredibly useful if we are interested in performing any sort of text analysis using webpages as our source. I intend to use the tools and methods introduced here to ultimately perform a sentiment analysis of user reviews for REDS Midtown Tavern in Toronto, Ontario. The restaurant’s website is located here: www.redsmidtowntavern.com. This report is divided into two primary sections: data gathering and data cleaning. In the former, I discuss the methods and tools I used to gather customer review data for REDS from popular review websites. In the second section, I go into detail about the process I used to clean the raw data obtained from these websites.

Before we dive in, let’s load some libraries and set the working directory.

CapstoneDir = "/Users/Antoine/Documents/Work/DataScience/Springboard/FoundationsofDataScience/CapstoneProject"

setwd(CapstoneDir)

rm(list=ls())

#Load required libraries
suppressMessages(library(rvest))
suppressMessages(library(dplyr))
library(tidyr)
suppressMessages(library(readr))

2 Data Gathering

As mentioned in the Introduction, the aim of this project is to perform a sentiment analysis of REDS Midtown Tavern, based on available customer reviews. It is well known that there is a wealth of such information on the internet, contained on review and travel websites. For the purpose of this investigation, I have focussed my efforts on the following websites: https://www.yelp.com/toronto, https://www.opentable.com/toronto-restaurants, https://www.tripadvisor.ca/, and https://www.zomato.com/toronto. Though this data is readily available for us to see and read, we would like to acquire it in a more structured way, so as to facilitate processing and analysis. This can be done fairly simply using a variety of tools available online. More specifically, I gathered the review data from the aforementioned websites by using the rvest package in R (https://cran.r-project.org/web/packages/rvest/index.html) and the web scraping tool, SelectorGadget (http://selectorgadget.com/).

2.1 SelectorGadget and CSS Selectors

Before I begin discussing the web scraping process in depth, I want to spend some time discussing SelectorGadget. SelectorGadget is a wonderful open source tool that allows us to obtain CSS selectors on a website by clicking on select items of interest. CSS Selectors are what allows us to gather data from the web by using packages such as rvest. The difficulty arises in identifying which selectors are associated with which content on a webpage. Traditionally, one might attempt to identify a selector for specific content by scouring the HTML source code for a website. Though it is doable, this isn’t exactly effortless. SelectorGadget greatly improves this process by providing a point-and-click interface between the webpage content and the underlying selector.

Let’s elaborate on this using a relevant example. Here is the Yelp webpage for REDS Midtown Tavern: https://www.yelp.ca/biz/reds-midtown-tavern-toronto-2. We want to scrape this website for the customer reviews, which can be obtained by scrolling slightly down the page:

By running the SelectorGadget, we can select the first review to start identifying the corresponding CSS Selector:

We see that the SelectorGadget has returned the CSS Selector “p”. This selector include the customer reviews, which is what we want, but it also includes additional content, such as “Was this review …?” and the text under the heading “From the business”. We don’t need these elements of the webpage. The content that we have selected has been coloured in green, while the content that matches that selector has been highlighted in yellow. This is the initial stage in our filtering process. We can now choose to add additional content to our filter, by clicking on content that is not yet highlighted, or we can remove content from our filter by clicking on content that has been highlighted. In this case, we want to click on the text below “From the business” to remove it from our selection.

We see that content that had previously been highlighted but is now de-selected becomes red. We also notice that the text “Was this review…?” is now no longer included in our selection. It appears that the CSS selector “.review-content p” is associated with the review content on this webpage. We can verify this by noting that SelectorGadget has found 20 items matching this selector, which corresponds to the number of reviews on this page. Now that we have the selector that we want, we can import the data into R. This will be done in detail below.

2.1.1 CSS Selectors in Detail

I’d like to spend some time discussing how to read CSS selectors. In essence, CSS selectors allow us to identify given “blocks” of content within an HTML source file. One excellent website that allows you to get a feel for CSS selectors and how to work with them is https://flukeout.github.io/.

An HTML file is built up of different blocks of content, which are identified by their type. For example, <p> represents a paragraph block, <a> a hyperlink, <h1> a heading, and so on. You can practice working with HTML here: https://www.w3schools.com/html/default.asp

The most basic CSS selector is simply a selector of type. To obtain all content that is contained within a paragraph block, simply use the selector “p”. In fact, this is what happened in Figure ??, where the customer review content was contained in a parapraph block, but so was other extraneous content. Using the selector p identified all of this content.

In order to diversify content, HTML types can be associated with an ID or with a class. E.g. <div id="title"> or <div class="content">. These are both <div> blocks, but one has an ID called “title” and the other has a class called “content”. Content can be extracted from a webpage by selecting it based on ID or class. The CSS selector for an ID is a hashtag, #. E.g. if there exists

, we could select the element by using the selector #title. This will select all content with id="title". The selector for a class is a period, .. E.g. we can select <div class="content"> using .content.

Now that we have these basic selectors, we can combine them to access more specific content. One such way is using the descendant selector. This allows us to select content that is nested within a content block. The descendant selector just consists of writting two selectors separated by a space. E.g. We could select <div id="title"><p>Text/p></div> with the selector div p. We can also use the descendant selector with the ID or class selectors. E.g. #title p.

In order to obtain the customer reviews from Yelp in the above example, we used the following selector .review-content p. We can see that this uses a descendant selector with the class and type selectors. We are selecting the paragraph blocks, <p>, within a block with class="review-content".

Sepcific class content can be selected by combining the class selector with the type selector, in the following way type.class. E.g. if we have

and , we can select the <div> block specifically using div.title.

This covers the basics of CSS selector. This should give you some context for what SelectorGadget is returning when you click on webpage content.

2.2 Importing Data with rvest

3 Data Cleaning